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Abstract 

We address the problem of reinforcement learning in which observations 
may exhibit an arbitrary form of stochastic dependence on past observations 
and actions, i.e. environments more general than (PO)MDPs. The task for 
an agent is to attain the best possible asymptotic reward where the true gen- 
erating environment is unknown but belongs to a known countable family of 
environments. We find some sufficient conditions on the class of environments 
under which an agent exists which attains the best asymptotic reward for any 
environment in the class. We analyze how tight these conditions are and 
how they relate to different probabilistic assumptions known in reinforcement 
learning and related fields, such as Markov Decision Processes and mixing 
conditions. 
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1 Introduction 



Many real- world "learning" problems (like learning to drive a car or playing a game) 
can be modelled as an agent tt that interacts with an environment // and is (occa- 
sionally) rewarded for its behavior. We are interested in agents which perform well 
in the sense of having high long-term reward, also called the value V(/Jj,tt) of agent tt 
in environment /i. If /i is known, it is a pure (non-learning) computational problem 
to determine the optimal agent tt^ := argmax 7r V A (/i,7r). It is far less clear what an 
"optimal" agent means, if \x is unknown. A reasonable objective is to have a single 
policy tt with high value simultaneously in many environments. We will formalize 
and call this criterion self- optimizing later. 

Learning approaches in reactive worlds. Reinforcement learning, sequential 
decision theory, adaptive control theory, and active expert advice, are theories deal- 
ing with this problem. They overlap but have different core focus: Reinforcement 
learning algorithms [S B98 ] are developed to learn \x or directly its value. Temporal 
difference learning is computationally very efficient, but has slow asymptotic guar- 
antees (only) in (effectively) small observable MDPs. Others have faster guarantee 
in finite state MDPs [BT99J. There are algorithms [EDKM05J which are optimal 
for any finite connected POMDP, and this is apparently the largest class of envi- 
ronments considered. In sequential decision theory, a Bayes-optimal agent tt* that 
maximizes V(^,tt) is considered, where £ is a mixture of environments veC and C is 
a class of environments that contains the true environment fx EC [Hut05j . Policy tt* 
is self-optimizing in an arbitrary (e.g. non-POMDP) class C, provided C allows for 
self-optimizingness jHu t02j. Adaptive control theory |KV86] considers very simple 
(from an AI perspective) or special systems (e.g. linear with quadratic loss function), 
which sometimes allow computationally and data efficient solutions. Action with ex- 
pert advice jdFM04l IPH051 IPH061 ICBL06] constructs an agent (called master) that 



performs nearly as well as the best agent (best expert in hindsight) from some class 
of experts, in any environment v. The important special case of passive sequence 
prediction in arbitrary unknown environments, where the actions=predictions do 
not affect the environment is comparably easy [Hut03l IHP04] . 

The difficulty in active learning problems can be identified (at least, for countable 
classes) with traps in the environments. Initially the agent does not know /z, so has 
asymptotically to be forgiven in taking initial "wrong" actions. A well-studied such 
class are ergodic MDPs which guarantee that, from any action history, every state 
can be (re)visited |Hut02j . 

What's new. The aim of this paper is to characterize as general as possible classes 
C in which self-optimizing behaviour is possible, more general than POMDPs. To 
do this we need to characterize classes of environments that forgive. For instance, 
exact state recovery is unnecessarily strong; it is sufficient being able to recover high 
rewards, from whatever states. Further, in many real world problems there is no 
information available about the "states" of the environment (e.g. in POMDPs) or 
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the environment may exhibit long history dependencies. 

Rather than trying to model an environment (e.g. by MDP) we try to identify 
the conditions sufficient for learning. Towards this aim, we propose to consider only 
environments in which, after any arbitrary finite sequence of actions, the best value is 
still achievable. The performance criterion here is asymptotic average reward. Thus 
we consider such environments for which there exists a policy whose asymptotic 
average reward exists and upper-bounds asymptotic average reward of any other 
policy. Moreover, the same property should hold after any finite sequence of actions 
has been taken (no traps). We call such environments recoverable. If we only want 
to get e-close to the optimal value infinitely often with decreasing e (that is, to have 
the same upper limit for the average value), then this property is already sufficient. 

Yet recoverability in itself is not sufficient for identifying behaviour which results 
in optimal limiting average value. We require further that, from any sequence of k 
actions, it is possible to return to the optimal level of reward in o(k) steps; that is, it 
is not just possible to recover after any sequence of (wrong) actions, but it is possible 
to recover fast. Environments which possess this property are called value-stable. 
(These conditions will be formulated in a probabilistic form.) 

We show that for any countable class of value-stable environments there exists 
a policy which achieves the best possible value in any of the environments from the 
class (i.e. is self- optimizing for this class). 

Furthermore, we present some examples of environments which possess value- 
stability and/or recoverability. In particular, any ergodic MDP can be easily shown 
to be value-stable. A mixing-type condition which implies value-stability is also 
demonstrated. In addition, we provide a construction allowing to build examples of 
value-stable and/or recoverable environments which are not isomorphic to a finite 
POMDP, thus demonstrating that the class of value-stable environments is quite 
general. 

Finally, we consider environments which are not recoverable but still are value- 
stable. In other words, we consider the question of what it means to be optimal in 
an environment which does not "forgive" wrong actions. Even in such cases some 
policies are better than others, and we identify some conditions which are sufficient 
for learning a policy that is optimal from some point on. 

It is important in our argument that the class of environments for which we seek 
a self-optimizing policy is countable, although the class of all value-stable environ- 
ments is uncountable. To find a set of conditions necessary and sufficient for learning 
which do not rely on countability of the class is yet an open problem. However, from 
a computational perspective countable classes are sufficiently large (e.g. the class of 
all computable probability measures is countable). 

Contents. The paper is organized as follows. Section [2] introduces necessary no- 
tation of the agent framework. In Section [3] we define and explain the notion of 
value- stability, which is central to the paper, and a weaker but simpler notion of 
recoverability. Section H] presents the theorems about self-optimizing policies for 
classes of value-stable environments and recoverable environments. In Section we 
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discuss what can be achieved if the environments are not recoverable. Section [6] 
illustrates the applicability of the theorems by providing examples of value-stable 
and recoverable environments. In Section [7] we discuss necessity of the conditions of 
the main theorems. Section [H] provides some discussion of the results and an outlook 
to future research. Formal proofs of the main theorems are given in Sectional while 
Sections H] and contain only intuitive explanations. 

2 Notation and Definitions 

We essentially follow the notation of [Hut02l IHut05] . 

Strings and probabilities. We use letters i,k,l,m,n G IN for natural numbers, 
and denote the cardinality of sets S by j^S. We write X* for the set of finite 
strings over some alphabet X, and X°° for the set of infinite sequences. For a string 
x G X* of length l{x) = n we write x\X2-..x n with x t G X and further abbreviate 
x k ; n :=x k x k+ i...x n -ix n and x <n := 2i. Finally, we define x k .. n := x k + ...+x n , 
provided elements of X can be added. 

We assume that sequence u)=Ui :oo G X°° is sampled from the "true" probability 
measure /i, i.e. P[o>i :ri = Xi :n ] = fi(xi :n ). We denote expectations w.r.t. \i by E, i.e. 
for a function / : X n -> M, E[f] = E[/(o; 1:fl )] = Z^M^rO/O^n)- When we use 
probabilities and expectations with respect to other measures we make the notation 
explicit, e.g. E u is the expectation with respect to v. Measures V\ and z/ 2 are called 
singular if there exists a set A such that v\(A) =0 and V2{A) = 1. 

The agent framework is general enough to allow modelling nearly any kind of (in- 
telligent) system |RN95] . In cycle k, an agent performs action y k ^y (output) which 
results in observation o k EO and reward r k <E7Z, followed by cycle k + 1 and so on. 
We assume that the action space y, the observation space O, and the reward space 
71 C M are finite, w.l.g. TZ={0,...,r max }. We abbreviate z k : = y k r k o k eZ:=yx7lxO 
and x k = r k o k G X :=1ZxO. An agent is identified with a (probabilistic) policy 
7i. Given history z <k , the probability that agent tt acts y k in cycle k is (by defini- 
tion) ir(y k \z <k ). Thereafter, environment \x provides (probabilistic) reward r k and 
observation o k , i.e. the probability that the agent perceives x k is (by definition) 
ti(x k \z <k y k ). Note that the policy and the environment are allowed to depend on 
the complete history. We do not make any MDP or POMDP assumption here, and 
we don't talk about states of the environment, only about observations. Each (pol- 
icyenvironment) pair (7r,/x) generates an I/O sequence z^z 1 ^ ■■■■ Mathematically, 
the history z\^ k is a random variable with probability 

p { z i!k = Zl - k ) = ^(S/O ' M^ibi) ■ ••• • n(yk\z<k) ■ ^{x k \z <k y k ). 

Since value maximizing policies can always be chosen deterministic, there is no real 
need to consider probabilistic policies, and henceforth we consider deterministic 
policies p. We assume that /iGC is the true, but unknown, environment, and uEC 
a generic environment. 
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3 Setup 



For an environment v and a policy p define random variables (upper and lower 
average value) 

V{is,p) : = limsup{ir^ m } and V(v,p) := liminf {±if„ m } 

m m 

where ri.. m :—ri + ...+r m . If there exists a constant V or a constant V_ such that 

V(v,p) = V a.s., or Y_(v,p) = V_ a.s. 

then we say that the upper limiting average or (respectively) lower average value 
exists, and denote it by V(v,p) := V (or V_(v,p) :—V_). If both upper and lower 
average limiting values exist and are equal then we simply say that average limiting 
value exist and denote it by V(u,p): = V(u,p) = V_(u,p) 

An environment v is explorable if there exists a policy p v such that V(u,p u ) 
exists and V(v,p) <V(v,p v ) with probability 1 for every policy p. In this case define 
V* : = V(u,p u ). An environment v is upper explorable if there exists a policy p v such 
that V{y,p u ) exists and V(u,p) < V(u,p v ) with probability 1 for every policy p. In 
this case define V u : = V(u,p u ). 

A policy p is self- optimizing for a set of explorable environments C if V(u,p) = 
V* for every v G C. A policy p is upper self- optimizing for a set of explorable 
environments C if V(v,p) —V v for every uEC. 

In the case when we we wish to obtain the optimal average value for any en- 
vironment in the class we will speak about self-optimizing policies, whereas if we 
are only interested in obtaining the upper limit of the average value then we will 
speak about upper self-optimizing policies. It turns out that the latter case is much 
simpler. The next two definitions present conditions on the environments which will 
be shown to be sufficient to achieve the two respective goals. 

Definition 1 (recoverable). We call an upper explorable environment v recoverable 
if for any history z<k such that v(x < k\y<k)>® there exists a policy p such that 

P(V(u,p) = V*\z <k ) = l. 

Conditioning on the history z <k means that we take //-conditional probabilities 
(conditional on x <k ) and first k — 1 actions of the policy p are replaced by y <k . 

Recoverability means that after taking any finite sequence of (possibly sub- 
optimal) actions it is still possible to obtain the same upper limiting average value 
as an optimal policy would obtain. The next definition is somewhat more complex. 

Definition 2 (value-stable environments). An explorable environment v is value- 
stable if there exist a sequence of numbers r\ G [0,r max ] and two functions d v (k,e) 
and (p v (n,e) such that \r\ n — >V*, d u (k,e) — o(k), Xl^liV 9 ^^) < 00 f or every fixed 
e, and for every k and every history z <k there exists a policy p=p z u <k such that 

p K.fc+n - r Zk +n > d »(k, e)+ne\ z <k ) < ip„(n, e). (1) 
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First of all, this condition means that the strong law of large numbers for rewards 
holds uniformly over histories z^', the numbers r" here can be thought of as expected 
rewards of an optimal policy. Furthermore, the environment is "forgiving" in the 
following sense: from any (bad) sequence of k actions it is possible (knowing the 
environment) to recover up to o(k) reward loss; to recover means to reach the level 
of reward obtained by the optimal policy which from the beginning was taking only 
optimal actions. That is, suppose that a person A has made k possibly suboptimal 
actions and after that "realized" what the true environment was and how to act 
optimally in it. Suppose that a person B was from the beginning taking only optimal 
actions. We want to compare the performance of A and B on first n steps after the 
step k. An environment is value stable if A can catch up with B except for o(k) 
gain. The numbers r\ can be thought of as expected rewards of B; A can catch 
up with B up to the reward loss d u (k,e) with probability ip u (n,e), where the latter 
does not depend on past actions and observations (the law of large numbers holds 
uniformly) . 

Examples of value-stable environments will be considered in Section 

4 Main Results 

In this section we present the main self-optimizingness result along with an informal 
explanation of its proof, and a result on upper self-optimizingness, which turns out 
to have much more simple conditions. 

Theorem 3 (value-stable^self-optimizing). For any countable class C of value- 
stable environments, there exists a policy which is s elf- optimizing for C. 

A formal proof is given in the appendix; here we give some intuitive justification. 
Suppose that all environments in C are deterministic. We will construct a self- 
optimizing policy p as follows: Let u* be the first environment in C. The algorithm 
assumes that the true environment is u* and tries to get e-close to its optimal value 
for some (small) e. This is called an exploitation part. If it succeeds, it does some 
exploration as follows. It picks the first environment v e which has higher average 
asymptotic value than v l (V* e > V* t ) and tries to get e-close to this value acting 
optimally under v e . If it cannot get close to the i/ e -optimal value then v e is not 
the true environment, and the next environment can be picked for exploration (here 
we call "exploration" successive attempts to exploit an environment which differs 
from the current hypothesis about the true environment and has a higher average 
reward). If it can, then it switches to exploitation of u t , exploits it until it is e'-close 
to V* t , e' < e and switches to v e again this time trying to get e'-close to V v e ; and 
so on. This can happen only a finite number of times if the true environment is z/, 
since V* t <V* £ . Thus after exploration either v l or v e is found to be inconsistent with 
the current history. If it is v e then just the next environment v e such that V* e > V* t 
is picked for exploration. If it is v l then the first consistent environment is picked 
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for exploitation (and denoted This in turn can happen only a finite number of 
times before the true environment v is picked as zA After this, the algorithm still 
continues its exploration attempts, but can always keep within e^— >0 of the optimal 
value. This is ensured by d(k) =o(k). 

The probabilistic case is somewhat more complicated since we can not say 
whether an environment is "consistent" with the current history. Instead we test 
each environment for consistency as follows. Let £ be a mixture of all environments 
in C. Observe that together with some fixed policy each environment \x can be con- 
sidered as a measure on Z°°. Moreover, it can be shown that (for any fixed policy) 
the ratio j^— 2 ] is bounded away from zero if v is the true environment /i and tends 
to zero if v is singular with \i (in fact, here singularity is a probabilistic analogue 
of inconsistency). The exploration part of the algorithm ensures that at least one 
of the environments v t and v e is singular with v on the current history, and a suc- 
cession of tests u \ z<n \ > a s with a s — > is used to exclude such environments from 
consideration. 

Upper self-optimizingness. Next we consider the task in which our goal is more 
moderate. Rather than trying to find a policy which will obtain the same average 
limiting value as an optimal one for any environment in a certain class, we will try 
to obtain only the optimum upper limiting average. That is, we will try to find 
a policy which infinitely often gets as close as desirable to the maximum possible 
average value. It turns out that in this case a much simpler condition is sufficient: 
recoverability instead of value-stability. 

Theorem 4 (recoverable^* upper self-optimizing). For any countable class C of re- 
coverable environments, there exists a policy which is upper s elf -optimizing for C. 

A formal proof can be found in Section \^ its idea is as follows. The upper self- 
optimizing policy p to be constructed will loop through all environments in C in such 
a way that each environment is tried infinitely often, and for each environment the 
agent will try to get e-close (with decreasing e) to the upper-limiting average value, 
until it either manages to do so, or a special stopping condition holds: v ) z<n \ < a s , 
where a s is decreasing accordingly. This condition necessarily breaks if the upper 
limiting average value cannot be achieved. 

5 Non-recoverable environments 

Before proceeding with examples of value-stable environments, we briefly discuss 
what can be achieved if an environment does not forgive initial wrong actions, that 
is, is not recoverable. It turns out that value-stability can be defined for non- 
recoverable environments as well, and optimal — in a worst-case sense — policies 
can be identified. 
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For an environment u, a policy p and a history z <k such that v(x <k \y <k ) >0, if 
there exists a constant V or a constant V such that 

P(y(i/,p) = F|z< fc ) = l, or P(V(v,p) = V\z <k ) = l, 

then we say that the upper conditional (on z <k ) limiting average or (respec- 
tively) lower conditional average value exists, and denote it by V(u,p,z <k ) := V 
(or V_(v,p,z <k ) '■=¥.)■ If both upper and lower conditional average limiting values 
exist and are equal then we say that that average conditional value exist and denote 
it by V (v,p,(z <k )) :=V(i/,p,z <k ) = V(u,p,z <k ) 

Call an environment v strongly (upper) explorable if for any history z <k such that 
v i. x <k\y<k) > there exists a policy p z u <k such that V{y,p z v <k ^ {V(i>,pl <k )) exists and 
V(u,p,z <k )<V(u,p z u < k ,z <k ) (respectively V(u,p,z <k )<V(u,p z u < k ,z <k )) with probability 
1 for every policy p. In this case define V*(z <k ) : = V{u,pl <k ) (respectively V u (z <k ) : = 
V{v,pl<*)). 

For a strongly explorable environment v define the worst-case optimal value 

w:-.= mf y;( z<k ), 

and for a strongly upper explorable v define the worst-case upper optimal value 

W u := inf Vt(z <k ). 

k,z <k :u(x <k >Q) 

In words, the worst-case optimal value is the asymptotic average reward which is 
attainable with certainty after any finite sequence of actions has been taken. 

Note that a recoverable explorable environment is also strongly explorable. 

A policy p will be called worst-case self-optimizing or worst-case upper self- 
optimizing for a class of environments C if hminf— rf / rn > W*, or (respectively) 
limsup— Ti" m > W u with probability 1 for every ueC. 

Definition 5 (worst-case value-stable environments). A strongly explorable environ- 
ment v is worst-case value-stable if there exists a sequence of numbers r" € [0,r max ] 
and two functions d u (k,e) and ip u (n,e) such that \r v l n — > W* , d u (k,e) = o{k), 
Yl'^=i i Pv(. n i £ ) < 00 f or every fixed e, and for every k and every history z <k there 
exists a policy p=p z v <k such that 

p K.fc+n - r Zk +n > d AK e)+ne\ z <k ) < <p u (n, e). (2) 

Note that a recoverable environment is value-stable if and only if it is worst-case 
value-stable. 

Worst-case value stability helps to distinguish between irreversible actions (or 
"traps") and actions which result only in a temporary loss in performance; moreover, 
worst-case value- stability means that a temporary loss in performance can only be 
short (sublinear). 

Finally, we can establish the following result (cf. Theorems [3] and H]) . 
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Theorem 6 (worst-case self-optimizing). (i) For any countable set of worst-case 
value-stable environments C there exist a policy p which is worst-case self- 
optimizing for C. 

(ii) For any countable set of strongly upper explorable environments C there exist 
a policy p which is worst-case upper self- optimizing for C. 

The proof of this theorem is analogous to the proofs of Theorems |3] and HI the 
differences are explained in Section lAl 

6 Examples 

In this section we illustrate the results of the previous section with examples of 
classes of value-stable environments. These are also examples of recoverable envi- 
ronments, since recoverability is strictly weaker than value-stability. In the end of 
the section we also give some simple examples of recoverable but not value-stable 
environments. 

We first note that passive environments are value-stable. An environment is 
called passive if the observations and rewards do not depend on the actions of 
the agent. Sequence prediction task provides a well-studied (and perhaps the only 
reasonable) class of passive environments: in this task the agent is required to give 
the probability distribution of the next observation given the previous observations. 
The true distribution of observations depends only on the previous observations 
(and does not depend on actions and rewards). Since we have confined ourselves to 
considering finite action spaces, the agent is required to give ranges of probabilities 
for the next observation, where the ranges are fixed beforehand. The reward 1 is 
given if all the ranges are correct and the reward is given otherwise. It is easy to 
check that any such environment is value-stable with r\ = 1, d(k,e) = 1, (p(n,e) =0, 
since, knowing the distribution, one can always start giving the correct probability 
ranges (this defines the policy p u ). 

Obviously, there are active value stable environments too. The next proposition 
provides some conditions on mixing rates which are sufficient for value-stability; we 
do not intend to provide sharp conditions on mixing rates but rather to illustrate 
the relation of value-stability with mixing conditions. 

We say that a stochastic process hk, k&IN satisfies strong a-mixing conditions 
with coefficients a(k) if (see e.g. |Bos96j ) 

sup sup |P(£nC) -P(£)P(C)| < a(k), 

n£N B£cr(hi,...,h n ),Cecr(h n+k ,... ) 

where er() stands for the sigma-algebra generated by the random variables in brack- 
ets. Loosely speaking, mixing coefficients a reflect the speed with which the process 
"forgets" about its past. 
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Proposition 7 (mixing and value-stability). Suppose that an explorable environ- 
ment v is such that there exist a sequence of numbers r\ and a function d(k) such 
that n —>V*, d{k) = o{k), and for each z <k there exists a policy p such that the 
sequence r? u satisfies strong a-mixing conditions with coefficients a(k) = for 
some e>0 and 

rt. k+ n - E (rZ k+n I < d(k) 
for any n. Then v is value-stable. 

Proof. Using the union bound we obtain 

P K.fc+n - r T..k+n > d ( k ) + nE ) 

< I K. k+n - ErJVn > <*(*)) + P (Kk + n - Er^ + „| > ne) . 

The first term equals by assumption and the second term for each e can be shown 
to be summable using |Bos96, Thm.1.3]: for a sequence of uniformly bounded zero- 
mean random variables fj satisfying strong a-mixing conditions the following bound 
holds true for any integer gG [l,n/2] 

P (l r i..n| > ne) < ce~ £2q / c + cqa 

for some constant c; in our case we just set q = n^ . □ 

(PO)MDPs. Applicability of Theorem [3] and Proposition [7| can be illustrated 
on (PO)MDPs. We note that self-optimizing policies for (uncountable) classes of 
finite ergodic MDPs and POMDPs are known |BT99l IEDKM05] : the aim of the 
present section is to show that value-stability is a weaker requirement than the 
requirements of these models, and also to illustrate applicability of our results. We 
call fi a (stationary) Markov decision process (MDP) if the probability of perceiving 
x k eX, given history z <k y k only depends on y k Ey and In this &X is 

called a state, X the state space. An MDP /i is called ergodic if there exists a policy 
under which every state is visited infinitely often with probability 1. An MDP with 
a stationary policy forms a Markov chain. 

An environment is called a (finite) partially observable MDP (POMDP) if there 
is a sequence of random variables s k taking values in a finite space S called the state 
space, such that x k depends only on s k and y k , and s k +i is independent of s <k given 
s k . Abusing notation the sequence s\- k is called the underlying Markov chain. A 
POMDP is called ergodic if there exists a policy such that the underlying Markov 
chain visits each state infinitely often with probability 1. 

In particular, any ergodic POMDP v satisfies strong a-mixing conditions with 
coefficients decaying exponentially fast in case there is a set H C 1Z such that z/(r.;G 
H) = l and v(ri — r|s» = s,yi — y)^0 for each y e y,s G S,r G H,i G IN. Thus for any 
such POMDP v we can use Proposition [7] with d(k,e) a constant function to show 
that v is value-stable: 
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Corollary 8 (POMDP=^ value-stable). Suppose that a POMDP v is ergodic and 
there exists a set H C7Z such that v(ri&H) = l and u(ri = r\si = s,yi = y)^0 for each 
y<Ey,h£S,r<EH, where S is the finite state space of the underlying Markov chain. 
Then v is value-stable. 

However, it is illustrative to obtain this result for MDPs directly, and in a slightly 
stronger form. 

Proposition 9 (MDP=^> value-stable). Any finite-state ergodic MDP v is a value- 
stable environment. 

Proof. Let d(k,e) = 0. Denote by \x the true environment, let z <k be the current 
history and let the current state (the observation x k ) of the environment be a&X, 
where X is the set of all possible states. Observe that for an MDP there is an 
optimal policy which depends only on the current state. Moreover, such a policy is 
optimal for any history. Let p^ be such a policy. Let rf be the expected reward of 
Pn on step %. Let l(a,b) = min{n:x k+n = b\x k = a}. By ergodicity of \x there exists a 
policy p for which E/(6,a) is finite (and does not depend on A;). A policy p needs to 
get from the state b to one of the states visited by an optimal policy, and then acts 
according to p M . Let f{n)'- =I j^- We have 

P {K. k+n - rZ k+n \ > ne) < sup P (|E (« + > fc = a) - r£ fc J > ne)) 

< sup P(l(a,b) > f(n)/r max ) 

a,beX 

+ sup P ( E {r p k % n \x k = a) - r£f (n) > ne - f(n) x k+f[n) = a) 

< sup P(l{a,b) > f(n)/r max ) 

a,beX 

+ supP (|E {r{% n \x k = a) - r^J > ne - 2f(n) x k = a) . 



In the last term we have the deviation of the reward attained by the optimal policy 
from its expectation. Clearly, both terms are bounded exponentially in n. □ 

In the examples above the function d(k,e) is a constant and (p(n,e) decays ex- 
ponentially fast. This suggests that the class of value-stable environments stretches 
beyond finite (PO)MDPs. We illustrate this guess by the construction that follows. 

A general scheme for constructing value-stable environment or recoverable 
environments: infinitely armed bandit. Next we present a construction of envi- 
ronments which cannot be modelled as finite POMDPs but are value-stable and/or 
recoverable. Consider the following environment v. There is a countable family 
C — {Ci '■ i £ -2V"} of arms, that is, sources generating i.i.d. rewards and 1 (and, 
say, empty observations) with some probability 5i of the reward being 1. The action 
space y consists of three actions y = {g,u,d}. To get the next reward from the cur- 
rent arm Q an agent can use the action g. Let % denote the index of the current arm. 
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At the beginning i = 0, the current arm is Co an d then the agent can move between 
arms as follows: it can move U(i) arms "up" using the action u (i.e. i:=i + U{i)) 
or it can move D(i) arms "down" using the action d (i.e. i :=i — D(i) or if the 
result is negative). The reward for actions u and d is 0. In all the examples below 
U(i) = 1, that is, the action u takes the agent one arm up. 

Clearly, v is a POMDP with countably infinite number of states in the underlying 
Markov chain, which (in general) is not isomorphic to a finite POMDP. 

Claim 10. If D(i) = i for all zGlV then the environment v just constructed is value- 
stable. If D(i) = l then v is recoverable but not necessarily value-stable; that is, there 
are choices of the probabilities 5i such that v is not value-stable. 

Proof. First we show that in either case (D(i)=i or D(i) — 1) v is explorable. Let 
5 = supjgjy^. Clearly, V(u,p') <S with probability 1 for any policy p' . A policy p 
which, knowing all the probabilities Si, achieves V(v,p) = V_( v iP) —S-: V* a.s., can be 
easily constructed. Indeed, find a sequence Q, jGlV, where for each j there is i=:ij 
such that Cj = C«5 satisfying lim^oo^ = 5. The policy p should carefully exploit one 
by one the arms Q, staying with each arm long enough to ensure that the average 
reward is close to the expected reward with Ej probability, where Ej quickly tends 
to 0, and so that switching between arms has a negligible impact on the average 
reward. Thus v can be shown to be explorable. Moreover, a policy p just sketched 
can be made independent on (observation and) rewards. 

Next we show if D(i)—i, that is, the action d always takes the agent down to the 
first arm, then the environment is value-stable. Indeed, one can modify the policy 
p (possibly allowing it to exploit each arm longer) so that on each time step t (from 
some t on) we have j(t) < where j(t) is the number of the current arm on step 
t. Thus, after any actions-perceptions history one needs about y/k actions (one 
action u and enough actions d) to catch up with p. So, (JTJ) can be shown to hold 
with d(k,e) = \/k, r-j the expected reward of p on step i (since p is independent of 
rewards, rf u are independent), and the rates <p(n,e) exponential in n. 

To construct a non- value-stable environment with D(i) = l, simply set Sq = 1 and 
5j = for j >0; then after taking n actions u one can only return to optimal rewards 
with n actions (d), that is d{k) = o{k) cannot be obtained. Still it is easy to check 
that recoverability is preserved, whatever the choice of 8^. □ 

In the above construction we can also allow the action d to bring the agent 
d(i) < i steps down, where i is the number of the current environment (, according 
to some (possibly randomized) function d(i), thus changing the function d u (k,e) and 
possibly making it non-constant in e and as close as desirable to linear. 

7 Necessity of value- stability 

Now we turn to the question of how tight the conditions of value-stability are. The 
following proposition shows that the requirement d(k,e) = o(k) in ([T]) cannot be 
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relaxed. 



Proposition 11 (necessity of d(k,e) = o(k)). There exists a countable family of 
deterministic explorable environments C such that 

• for any veC for any sequence of actions y <k there exists a policy p such that 

r u k ..k+n < rZ k+n + k for all n>k, 

where r\ are the rewards attained by an optimal policy p v (which from the 
beginning was acting optimally) , but 

• for any policy p there exists an environment z/GC such that V_(u,p) <V* (i.e. 
there is no self- optimizing policy for C). 

Clearly, each environment from such a class C satisfies the value stability condi- 
tions with (p(n,e)=0 except d(k,e) = k^o(k). 

Proof. There are two possible actions yiE{a,b}, three possible rewards r^G {0,1,2} 
and no observations. 

Construct the environment uq as follows: if yi = a then r$ = 1 and if yi — b then 
r i = for any % G IN. 

For each i let n« denote the number of actions a taken up to step i: m:=#{j < 
i:yj = a}. For each s > construct the environment v s as follows: r^ia) = 1 for any 
i, Tiib) =2 if the longest consecutive sequence of action b taken has length greater 
than ni and rii>s] otherwise rj(6)=0. 

It is easy to see that each z/j, i>0 satisfies the value stability conditions with 
ip(n,e) = except d(k,e) — k^ o(k), and does not satisfy it with any d(k,e) = o{k). 
Next we show that there is no self-optimizing policy for the class. 

Suppose that there exists a policy p such that V_(ui,p) = V*. for each i > and let 
the true environment be u . By assumption, for each s there exists such n that 

#{z < n : yi = b, r, = 0} > s > #{i < n : y { = a,r { = 1} 

which implies V(v ,p) < 1/2 < 1 = V* . □ 

It is also easy to show that the uniformity of convergence in (TJP cannot be 
dropped. That is, if in the definition of value-stability we allow the function ^p(n,e) 
to depend additionally on the past history z <k then Theorem [3] does not hold. This 
can be shown with the same example as constructed in the proof of Proposition [TT], 
letting d(k,e) =0 but instead allowing <£>(n,£,z <fc ) to take values and 1 according 
to the number of actions a taken, achieving the same behaviour as in the example 
provided in the last proof. 

Moreover, we show that the requirement that the class C to be learnt is countable 
cannot be easily withdrawn. Indeed, consider the class of all deterministic passive 
environments in the sequence prediction setting. In this task an agent gets the 
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reward 1 if yi = Oi + \ and the reward otherwise, where the sequence of observation 
Oi is deterministic. Different sequences correspond to different environments. As 
it was mentioned before, any such environment v is value-stable with d u (k,e) = 1, 
ip v (n,e) = and r\ = 1. Obviously, the class of all deterministic passive environments 
is not countable. Since for every policy p there is an environment on which p errs 
exactly on each step, the class of all deterministic passive environments cannot be 
learned. Therefore, the following statement is valid: 

Claim 12. There exist (uncountable) classes of value-stable environments for which 
there are no self- optimizing policies. 

However, strictly speaking, even for countable classes value-stability is not nec- 
essary for self-optimizingness. This can be demonstrated on the class z/j : i > from 
the proof of Proposition [Til (Whereas if we add z/ to the class a self-optimizing 
policy no longer exists.) So we have the following: 

Claim 13. There are countable classes of not value-stable environments for which 
self- optimizing policies exist. 

8 Discussion 

Summary. We have proposed a set of conditions on environments, called value- 
stability, such that any countable class of value-stable environments admits a self- 
optimizing policy. It was also shown that these conditions are in a certain sense 
tight. The class of all value-stable environments includes ergodic MDPs, certain 
class of finite POMDPs, passive environments, and (provably) more environments. 
So the concept of value-stability allows us to characterize self-optimizing environ- 
ment classes, and proving value-stability is typically much easier than proving self- 
optimizingness directly. Value stability means that from any (sup-optimal) sequence 
of actions it is possibly to recover fast. If it is possible to recover, but not necessarily 
fast, then we get a condition which we called recoverability, which was shown to be 
sufficient to be able to recover the upper limit of the optimal average asymptotic 
value. We have also analyzed what can be achieved in environments which possess 
(worst-case) value-stability but are not recoverable; it turned out that a certain 
worst-case self-optimizingness can be identified in this case too. 

On the following picture we summarize the concepts introduced in Sections [21 SI 
and The arrows symbolize implications: some of them follow from theorems or 
stated in definitions (marked accordingly), while others are trivial. 



14 



explorable 



DefH] 



upper explorable 



strongly exp. 



Dcf[T] 



Dcf[5] 



strongly upper exp. 



value-stable 



Th[3] 



recoverable 



Th||] 



worst-case val.-st. 



self-optimizing 



upper self-opt. 



worst-case self-opt. 



Th[H(ii) 



^ worst-case upper self-opt. 



Outlook. We considered only countable environment classes C. From a computa- 
tional perspective such classes are sufficiently large (e.g. the class of all computable 
probability measures is countable). On the other hand, countability excludes contin- 
uously parameterized families (like all ergodic MDPs), common in statistical prac- 
tice. So perhaps the main open problem is to find under which conditions the 
requirement of countability of the class can be lifted. Another important question 
is whether (meaningful) necessary and sufficient conditions for self-opt imizingness 
can be found. However, identifying classes of environments for which self-optimizing 
policies exist is a hard problem which has not been solved even for passive environ- 
ments |RH06j . 

One more question concerns the uniformity of forgetfulness of the environment. 
Currently in the definition of value-stability ([T]) we have the function <p(n,e) which 
is the same for all histories z<k, that is, both for all actions histories y<£ and 
observations- rewards histories x<fc. Probably it is possible to differentiate between 
two types of forgetfulness, one for actions and one for perceptions. 

In this work we have chosen the asymptotic uniform average value lim— r^ m as 
our performance measure. Another popular measure is the asymptotic discounted 
value 7ir 1 + 7 2 r 2 -|-..., where 7 is some (typically geometric 7^ oc 7 fe ) discount se- 
quence. One can show [Hut06j under quite general conditions that the limit of aver- 
age and future discounted values coincide. Equivalence holds for bounded rewards 
and monotone decreasing 7, in deterministic environments and, in expectation over 
the history, also for probabilistic environments. So, in these cases our results also 
apply to discounted value. 

Finally, it should be mentioned that we have concentrated on optimal values 
which can be obtained with certainty (with probability one); towards this aim we 
have defined (upper, strong) explorability and only considered environments which 
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possess one of these properties. It would also be interesting to analyze what is 
achievable in environments which are not (upper, strongly) explorable; for example, 
one could consider optimal expected value, and may be some other criteria. 



A Proofs of Theorems [3] and S 



In each of the proofs, a self-optimizing (or upper self-optimizing) policy p will be 
constructed. When the policy p has been defined up to a step k, an environment /i, 
endowed with this policy, can be considered as a measure on Z k . We assume this 
meaning when we use environments as measures on Z k (e.g. y u(z <i )). 

Proof of Theorem [3]. A self-optimizing policy p will be constructed as follows. On 
each step we will have two polices: p l which exploits and p e which explores; for each 
i the policy p either takes an action according to p l {pi^z^) = p t (z <i )) or according 
to p e (p(z <i ) =p e (z <i )), as will be specified below. 

In the algorithm below, i denotes the number of the current step in the sequence 
of actions-observations. Let n=l, s = l, and j t = j e = 0- Let also a s = 2~ s for sgJV. 
For each environment v, find such a sequence of real numbers e v n that e v n — > and 

Let 1 : IV — > C be such a numbering that each i/eC has infinitely many indices. 
For alH > 1 define a measure £ as follows 

Z(z<i) =^2w v u(z < i), (3) 

u<=C 

where w u &7Z are (any) such numbers that ^„"UV = 1 and w u >0 for all ueC. 
Define T. On each step % let 

I f(*<i) 

Define i/*. Set v l to be the first environment in T with index greater than i(j*). 
In case this is impossible (that is, if T is empty), increment s, (re)define T and try 
again. Increment j t . 

Define u e . Set v e to be the first environment with index greater than i(j e ) such 
that V* e > V* t and v e {z < k) >0, if such an environment exists. Otherwise proceed one 
step (according to p l ) and try again. Increment j e . 

Consistency. On each step i (re)define T. If v l ^ T, define z/, increment s and 
iterate the infinite loop. (Thus s is incremented only if v l is not in T or if T is 
empty.) 

Start the infinite loop. Increment n. 

Let 5:=(V* e -V* t )/2. Let e:=e£. Hs<5 set 5 = e. Let h = f. 
Prepare for exploration. 

Increment h. The index h is incremented with each next attempt of exploring 
v e . Each attempt will be at least h steps in length. 
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Let p t= p y J~' and set p=p f . 

Let ih be the current step. Find k\ such that 



l fv; t < e/8 (4) 



Find k 2 > 1%^ such that for all m > k 2 

1 



r — V* 



< (5) 



\m-ih 
Find /C3 such that 

hr max /k 3 < e/8. (6) 

Find /C4 such that for all m>ka t 

11 1 

—d v Am,e A) < e/8, —d u t(m,e/8) < e/8 and —d u t(i h ,e/8) < e/8. (7) 
m m m 

Moreover, it is always possible to find such k> ma.x{ki,k2,k 3 ,k4} that 

^rC 3k > ^rt 3k + 5. (8) 

Iterate up to the step k. 
Exploration. Set p e = plf n . Iterate h steps according to p = p e . Iterate further 
until either of the following conditions breaks 

W Ki-r P k : i \<(i-k)e/MAk,e/4), 

(ii) i<3k. 

(iii) z/ e eT. 

Observe that either (i) or (ii) is necessarily broken. 

If on some step v l is excluded from T then the infinite loop is iterated. If after 
exploration v e is not in T then redefine v e and iterate the infinite loop. If both 
v l and v e are still in T then return to "Prepare for exploration" (otherwise the loop 
is iterated with either v l or v e changed). 
End of the infinite loop and the algorithm. 

Let us show that with probability 1 the "Exploration" part is iterated only a 
finite number of times in a row with the same u* and v e . 

Suppose the contrary, that is, suppose that (with some non-zero probability) 
the "Exploration" part is iterated infinitely often while i/'/eT. Observe that ([T]) 
implies that the z/ e -probability that (i) breaks is not greater than (p Ue (i—k,e/4:)] hence 
by Borel-Cantelli lemma the event that (i) breaks infinitely often has probability 
under v e . 

Suppose that (%) holds almost every time. Then (ii) should be broken except 
for a finite number of times. We can use (j3J, ([ZD and (JSJ) to show that with 
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probability at least l — (p u t(k — i h ,e/4) under v l we have g^r^ 3fc > V* t +e/2. Again 
using Borel-Cantelli lemma and k > 2if t we obtain that the event that (ii) breaks 
infinitely often has probability under zA 

Thus (at least) one of the environments v t and v e is singular with respect to 
the true environment v given the described policy and current history. Denote this 
environment by v' . It is known (see e.g. [CS041 Thm.26]) that if measures fi and v 



are mutually singular then Bp^^l _ >00 u-a.s. Thus 



^(xi,...,x n ) 

v(z<i) 



u-a.s. (9) 



Observe that (by definition of £) jf^r is bounded. Hence using we can see that 



«*<0 
v\z<i 



z/-a.s. 



Since s and a s are not changed during the exploration phase this implies that on 
some step v' will be excluded from T according to the "consistency" condition, which 
contradicts the assumption. Thus the "Exploration" part is iterated only a finite 
number of times in a row with the same v l and v e . 

Observe that s is incremented only a finite number of times since u t ^ z<1 } is 
bounded away from where v is either the true environment v or any environ- 
ment from C which is equivalent to v on the current history. The latter follows from 
the fact that ^rry is a submartingale with bounded expectation, and hence, by the 
submartingale convergence theorem (see e.g. [Doo53j ) converges with //-probability 
1. 

Let us show that from some step on v (or an environment equivalent to it) is 
always in T and selected as v l . Consider the environment v l on some step i. If 
V* t > V* then v t will be excluded from T since on any optimal for v l sequence of 
actions (policy) measures v and v* are singular. If V* t < V* than v e will be equal 
to v at some point, and, after this happens sufficient number of times, v t will be 
excluded from T by the "exploration" part of the algorithm, s will be decremented 
and v will be included into T. Finally, if V* t = V* then either the optimal value V* is 
(asymptotically) attained by the policy p t of the algorithm, or (if p u t is suboptimal 
for v) ^r v ^ i <V* t —e infinitely often for some e, which has probability under v l and 
consequently v l is excluded from T. 

Thus, the exploration part ensures that all environments not equivalent to v with 
indices smaller than \[y) are removed from T and so from some step on v l is equal 
to (an environment equivalent to) the true environment v. 

We have shown in the "Exploration" part that n^oo, and so e v n — >0. Finally, 
using the same argument as before (Borel-Cantelli lemma, (i) and the definition of 
k) we can show that in the "exploration" and "prepare for exploration" parts of the 
algorithm the average value is within e v n of V* t provided the true environment is 
(equivalent to) zA □ 
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Proof of Theorem |4l Let 1 : IN — > C be such a numbering that each i/gC has 
infinitely many indices. Define the measure £ as in ([3]). The policy p acts according 
to the following algorithm. 

Set e s = a s = 2~ s for s G JV, set j '• = 1, s = n = 1. The integer i will denote the 
current step in time. 

Do the following ad infinitum. Set v to be the first environment in C with index 
greater than i(j). Find the policy p v which achieves the upper limiting average 
value with probability one (such policy exists by definition of recoverability) . Act 
according to p v until either 

~ r T..i - v*(p,p> 

% 



(10) 



or 

Increment n, s, i. 

It can be easily seen that one of the conditions necessarily breaks. Indeed, either 
in the true environment the optimal upper limiting average value for the current 
environment v can be achieved by the optimal policy p v , in which case ([TO]) breaks; 
or it cannot be achieved, which means that v and £ are singular, which implies that 
f ill I) will be broken (see e.g. [CS041 Thm.26]; cf. the same argument in the proof of 
Theorem [3]). Since v equals the true environment infinitely often and e n — >0 we get 
the statement of the theorem. □ 

Proof of Theorem [6]is analogous to the proofs of Theorems [3] and HJ except for the 
following. Instead of the optimal average value V* and upper optimal average value 
V v the values V*^^) and V u (z <k ) should be used, and they should be updated 
after each step k. □ 
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